AITopics | accelerator design

Collaborating Authors

accelerator design

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

SparseMap: A Sparse Tensor Accelerator Framework Based on Evolution Strategy

Zhao, Boran, Zhai, Haiming, Yuan, Zihang, Liu, Hetian, Xia, Tian, Zhao, Wenzhe, Ren, Pengju

arXiv.org Artificial IntelligenceAug-19-2025

The growing demand for sparse tensor algebra (SpTA) in machine learning and big data has driven the development of various sparse tensor accelerators. However, most existing manually designed accelerators are limited to specific scenarios, and it's time-consuming and challenging to adjust a large number of design factors when scenarios change. Therefore, automating the design of SpTA accelerators is crucial. Nevertheless, previous works focus solely on either mapping (i.e., tiling communication and computation in space and time) or sparse strategy (i.e., bypassing zero elements for efficiency), leading to suboptimal designs due to the lack of comprehensive consideration of both. A unified framework that jointly optimizes both is urgently needed. However, integrating mapping and sparse strategies leads to a combinatorial explosion in the design space(e.g., as large as $O(10^{41})$ for the workload $P_{32 \times 64} \times Q_{64 \times 48} = Z_{32 \times 48}$). This vast search space renders most conventional optimization methods (e.g., particle swarm optimization, reinforcement learning and Monte Carlo tree search) inefficient. To address this challenge, we propose an evolution strategy-based sparse tensor accelerator optimization framework, called SparseMap. SparseMap constructing a more comprehensive design space with the consideration of both mapping and sparse strategy. We introduce a series of enhancements to genetic encoding and evolutionary operators, enabling SparseMap to efficiently explore the vast and diverse design space. We quantitatively compare SparseMap with prior works and classical optimization methods, demonstrating that SparseMap consistently finds superior solutions.

artificial intelligence, evolutionary algorithm, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2508.12906

Country: Asia > China (0.28)

Genre: Research Report (0.64)

Industry:

Information Technology (0.47)
Health & Medicine > Pharmaceuticals & Biotechnology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Designing Efficient LLM Accelerators for Edge Devices

Haris, Jude, Saha, Rappy, Hu, Wenhao, Cano, José

arXiv.org Artificial IntelligenceAug-1-2024

The increase in open-source availability of Large Language Models (LLMs) has enabled users to deploy them on more and more resource-constrained edge devices to reduce reliance on network connections and provide more privacy. However, the high computation and memory demands of LLMs make their execution on resource-constrained edge devices challenging and inefficient. To address this issue, designing new and efficient edge accelerators for LLM inference is crucial. FPGA-based accelerators are ideal for LLM acceleration due to their reconfigurability, as they enable model-specific optimizations and higher performance per watt. However, creating and integrating FPGA-based accelerators for LLMs (particularly on edge devices) has proven challenging, mainly due to the limited hardware design flows for LLMs in existing FPGA platforms. To tackle this issue, in this paper we first propose a new design platform, named SECDA-LLM, that utilizes the SECDA methodology to streamline the process of designing, integrating, and deploying efficient FPGA-based LLM accelerators for the llama.cpp inference framework. We then demonstrate, through a case study, the potential benefits of SECDA-LLM by creating a new MatMul accelerator that supports block floating point quantized operations for LLMs. Our initial accelerator design, deployed on the PYNQ-Z1 board, reduces latency 1.7 seconds per token or ~2 seconds per word) by 11x over the dual-core Arm NEON-based CPU execution for the TinyLlama model.

accelerator, llama, llm, (16 more...)

arXiv.org Artificial Intelligence

2408.00462

Country:

North America > United States > New York (0.04)
North America > United States > Nevada > Clark County > Las Vegas (0.04)
Europe > United Kingdom > Scotland > City of Glasgow > Glasgow (0.04)
Asia > South Korea > Incheon > Incheon (0.04)

Genre: Research Report (0.42)

Industry: Information Technology (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

The Magnificent Seven Challenges and Opportunities in Domain-Specific Accelerator Design for Autonomous Systems

Neuman, Sabrina M., Plancher, Brian, Reddi, Vijay Janapa

arXiv.org Artificial IntelligenceJul-24-2024

The end of Moore's Law and Dennard Scaling has combined with advances in agile hardware design to foster a golden age of domain-specific acceleration. However, this new frontier of computing opportunities is not without pitfalls. As computer architects approach unfamiliar domains, we have seen common themes emerge in the challenges that can hinder progress in the development of useful acceleration. In this work, we present the Magnificent Seven Challenges in domain-specific accelerator design that can guide adventurous architects to contribute meaningfully to novel application domains. Although these challenges appear across domains ranging from ML to genomics, we examine them through the lens of autonomous systems as a motivating example in this work. To that end, we identify opportunities for the path forward in a successful domain-specific accelerator design from these challenges.

accelerator, artificial intelligence, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2407.17311

Country:

North America > United States > California > San Francisco County > San Francisco (0.16)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report (0.64)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (0.49)
Information Technology (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.65)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Add feedback

HASS: Hardware-Aware Sparsity Search for Dataflow DNN Accelerator

Yu, Zhewen, Sreeram, Sudarshan, Agrawal, Krish, Wu, Junyi, Montgomerie-Corcoran, Alexander, Zhang, Cheng, Cheng, Jianyi, Bouganis, Christos-Savvas, Zhao, Yiren

arXiv.org Artificial IntelligenceJun-5-2024

Deep Neural Networks (DNNs) excel in learning hierarchical representations from raw data, such as images, audio, and text. To compute these DNN models with high performance and energy efficiency, these models are usually deployed onto customized hardware accelerators. Among various accelerator designs, dataflow architecture has shown promising performance due to its layer-pipelined structure and its scalability in data parallelism. Exploiting weights and activations sparsity can further enhance memory storage and computation efficiency. However, existing approaches focus on exploiting sparsity in non-dataflow accelerators, which cannot be applied onto dataflow accelerators because of the large hardware design space introduced. As such, this could miss opportunities to find an optimal combination of sparsity features and hardware designs. In this paper, we propose a novel approach to exploit unstructured weights and activations sparsity for dataflow accelerators, using software and hardware co-optimization. We propose a Hardware-Aware Sparsity Search (HASS) to systematically determine an efficient sparsity solution for dataflow accelerators. Over a set of models, we achieve an efficiency improvement ranging from 1.3$\times$ to 4.2$\times$ compared to existing sparse designs, which are either non-dataflow or non-hardware-aware. Particularly, the throughput of MobileNetV3 can be optimized to 4895 images per second. HASS is open-source: \url{https://github.com/Yu-Zhewen/HASS}

accelerator, computation, sparsity, (17 more...)

arXiv.org Artificial Intelligence

2406.03088

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Europe > United Kingdom > England > Greater London > London (0.04)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Cognitive Science (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

GPT4AIGChip: Towards Next-Generation AI Accelerator Design Automation via Large Language Models

Fu, Yonggan, Zhang, Yongan, Yu, Zhongzhi, Li, Sixu, Ye, Zhifan, Li, Chaojian, Wan, Cheng, Lin, Yingyan

arXiv.org Artificial IntelligenceSep-19-2023

The remarkable capabilities and intricate nature of Artificial Intelligence (AI) have dramatically escalated the imperative for specialized AI accelerators. Nonetheless, designing these accelerators for various AI workloads remains both labor- and time-intensive. While existing design exploration and automation tools can partially alleviate the need for extensive human involvement, they still demand substantial hardware expertise, posing a barrier to non-experts and stifling AI accelerator development. Motivated by the astonishing potential of large language models (LLMs) for generating high-quality content in response to human language instructions, we embark on this work to examine the possibility of harnessing LLMs to automate AI accelerator design. Through this endeavor, we develop GPT4AIGChip, a framework intended to democratize AI accelerator design by leveraging human natural languages instead of domain-specific languages. Specifically, we first perform an in-depth investigation into LLMs' limitations and capabilities for AI accelerator design, thus aiding our understanding of our current position and garnering insights into LLM-powered automated AI accelerator design. Furthermore, drawing inspiration from the above insights, we develop a framework called GPT4AIGChip, which features an automated demo-augmented prompt-generation pipeline utilizing in-context learning to guide LLMs towards creating high-quality AI accelerator design. To our knowledge, this work is the first to demonstrate an effective pipeline for LLM-powered automated AI accelerator generation. Accordingly, we anticipate that our insights and framework can serve as a catalyst for innovations in next-generation LLM-powered design automation tools.

accelerator design, demonstration, llm, (12 more...)

arXiv.org Artificial Intelligence

2309.1073

Country: North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

SATAY: A Streaming Architecture Toolflow for Accelerating YOLO Models on FPGA Devices

Montgomerie-Corcoran, Alexander, Toupas, Petros, Yu, Zhewen, Bouganis, Christos-Savvas

arXiv.org Artificial IntelligenceSep-4-2023

AI has led to significant advancements in computer vision and image processing tasks, enabling a wide range of applications in real-life scenarios, from autonomous vehicles to medical imaging. Many of those applications require efficient object detection algorithms and complementary real-time, low latency hardware to perform inference of these algorithms. The YOLO family of models is considered the most efficient for object detection, having only a single model pass. Despite this, the complexity and size of YOLO models can be too computationally demanding for current edge-based platforms. To address this, we present SATAY: a Streaming Architecture Toolflow for Accelerating YOLO. This work tackles the challenges of deploying stateof-the-art object detection models onto FPGA devices for ultralow latency applications, enabling real-time, edge-based object detection. We employ a streaming architecture design for our YOLO accelerators, implementing the complete model on-chip in a deeply pipelined fashion. These accelerators are generated using an automated toolflow, and can target a range of suitable FPGA devices. We introduce novel hardware components to support the operations of YOLO models in a dataflow manner, and off-chip memory buffering to address the limited on-chip memory resources. Our toolflow is able to generate accelerator designs which demonstrate competitive performance and energy characteristics to GPU devices, and which outperform current state-of-the-art FPGA accelerators.

accelerator, architecture, detection, (17 more...)

arXiv.org Artificial Intelligence

2309.01587

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.34)

Add feedback

MARS: Exploiting Multi-Level Parallelism for DNN Workloads on Adaptive Multi-Accelerator Systems

Shen, Guan, Zhao, Jieru, Wang, Zeke, Lin, Zhe, Ding, Wenchao, Wu, Chentao, Chen, Quan, Guo, Minyi

arXiv.org Artificial IntelligenceJul-23-2023

Along with the fast evolution of deep neural networks, the hardware system is also developing rapidly. As a promising solution achieving high scalability and low manufacturing cost, multi-accelerator systems widely exist in data centers, cloud platforms, and SoCs. Thus, a challenging problem arises in multi-accelerator systems: selecting a proper combination of accelerators from available designs and searching for efficient DNN mapping strategies. To this end, we propose MARS, a novel mapping framework that can perform computation-aware accelerator selection, and apply communication-aware sharding strategies to maximize parallelism. Experimental results show that MARS can achieve 32.2% latency reduction on average for typical DNN workloads compared to the baseline, and 59.4% latency reduction on heterogeneous models compared to the corresponding state-of-the-art method.

accelerator, artificial intelligence, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2307.12234

Country: Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report > Promising Solution (0.54)

Industry: Information Technology > Services (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Weight Fixing Networks

Subia-Waud, Christopher, Dasmahapatra, Srinandan

arXiv.org Artificial IntelligenceOct-24-2022

Modern iterations of deep learning models contain millions (billions) of unique parameters, each represented by a b-bit number. Popular attempts at compressing neural networks (such as pruning and quantisation) have shown that many of the parameters are superfluous, which we can remove (pruning) or express with less than b-bits (quantisation) without hindering performance. Here we look to go much further in minimising the information content of networks. Rather than a channel or layer-wise encoding, we look to lossless whole-network quantisation to minimise the entropy and number of unique parameters in a network. We propose a new method, which we call Weight Fixing Networks (WFN) that we design to realise four model outcome objectives: i) very few unique weights, ii) low-entropy weight encodings, iii) unique weight values which are amenable to energy-saving versions of hardware multiplication, and iv) lossless task-performance. Some of these goals are conflicting. To best balance these conflicts, we combine a few novel (and some well-trodden) tricks; a novel regularisation term, (i, ii) a view of clustering cost as relative distance change (i, ii, iv), and a focus on whole-network re-use of weights (i, iii). Our Imagenet experiments demonstrate lossless compression using 56x fewer unique weights and a 1.9x lower weight-space entropy than SOTA quantisation approaches.

artificial intelligence, cluster centre, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2210.13554

Country: Europe > United Kingdom (0.04)

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

DANCE: Differentiable Accelerator/Network Co-Exploration

Choi, Kanghyun, Hong, Deokki, Yoon, Hojae, Yu, Joonsang, Kim, Youngsok, Lee, Jinho

arXiv.org Artificial IntelligenceSep-14-2020

To cope with the ever-increasing computational demand of the DNN execution, recent neural architecture search (NAS) algorithms consider hardware cost metrics into account, such as GPU latency. To further pursue a fast, efficient execution, DNN-specialized hardware accelerators are being designed for multiple purposes, which far-exceeds the efficiency of the GPUs. However, those hardware-related metrics have been proven to exhibit non-linear relationships with the network architectures. Therefore it became a chicken-and-egg problem to optimize the network against the accelerator, or to optimize the accelerator against the network. In such circumstances, this work presents DANCE, a differentiable approach towards the co-exploration of the hardware accelerator and network architecture design. At the heart of DANCE is a differentiable evaluator network. By modeling the hardware evaluation software with a neural network, the relation between the accelerator architecture and the hardware metrics becomes differentiable, allowing the search to be performed with backpropagation. Compared to the naive existing approaches, our method performs co-exploration in a significantly shorter time, while achieving superior accuracy and hardware cost metrics.

accelerator, artificial intelligence, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2009.06237

Country: Asia > South Korea > Seoul > Seoul (0.04)

Genre: Research Report (0.40)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Co-Exploration of Neural Architectures and Heterogeneous ASIC Accelerator Designs Targeting Multiple Tasks

Yang, Lei, Yan, Zheyu, Li, Meng, Kwon, Hyoukjun, Lai, Liangzhen, Krishna, Tushar, Chandra, Vikas, Jiang, Weiwen, Shi, Yiyu

arXiv.org Machine LearningFeb-10-2020

Neural Architecture Search (NAS) has demonstrated its power on various AI accelerating platforms such as Field Programmable Gate Arrays (FPGAs) and Graphic Processing Units (GPUs). However, it remains an open problem, how to integrate NAS with Application-Specific Integrated Circuits (ASICs), despite them being the most powerful AI accelerating platforms. The major bottleneck comes from the large design freedom associated with ASIC designs. Moreover, with the consideration that multiple DNNs will run in parallel for different workloads with diverse layer operations and sizes, integrating heterogeneous ASIC sub-accelerators for distinct DNNs in one design can significantly boost performance, and at the same time further complicate the design space. To address these challenges, in this paper we build ASIC template set based on existing successful designs, described by their unique dataflows, so that the design space is significantly reduced. Based on the templates, we further propose a framework, namely NASAIC, which can simultaneously identify multiple DNN architectures and the associated heterogeneous ASIC accelerator design, such that the design specifications (specs) can be satisfied, while the accuracy can be maximized. Experimental results show that compared with successive NAS and ASIC design optimizations which lead to design spec violations, NASAIC can guarantee the results to meet the design specs with 17.77%, 2.49x, and 2.32x reductions on latency, energy, and area and with 0.76% accuracy loss. To the best of the authors' knowledge, this is the first work on neural architecture and ASIC accelerator design co-exploration.

architecture, design spec, neural architecture, (16 more...)

arXiv.org Machine Learning

2002.04116

Country: North America > United States > Nevada > Clark County > Las Vegas (0.04)

Genre: Research Report (0.69)

Industry: Semiconductors & Electronics (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback